A high-performance computational workflow to accelerate GATK SNP detection across a 25-genome dataset

Background Single-nucleotide polymorphisms (SNPs) are the most widely used form of molecular genetic variation studies. As reference genomes and resequencing data sets expand exponentially, tools must be in place to call SNPs at a similar pace. The genome analysis toolkit (GATK) is one of the most widely used SNP calling software tools publicly available, but unfortunately, high-performance computing versions of this tool have yet to become widely available and affordable. Results Here we report an open-source high-performance computing genome variant calling workflow (HPC-GVCW) for GATK that can run on multiple computing platforms from supercomputers to desktop machines. We benchmarked HPC-GVCW on multiple crop species for performance and accuracy with comparable results with previously published reports (using GATK alone). Finally, we used HPC-GVCW in production mode to call SNPs on a “subpopulation aware” 16-genome rice reference panel with ~ 3000 resequenced rice accessions. The entire process took ~ 16 weeks and resulted in the identification of an average of 27.3 M SNPs/genome and the discovery of ~ 2.3 million novel SNPs that were not present in the flagship reference genome for rice (i.e., IRGSP RefSeq). Conclusions This study developed an open-source pipeline (HPC-GVCW) to run GATK on HPC platforms, which significantly improved the speed at which SNPs can be called. The workflow is widely applicable as demonstrated successfully for four major crop species with genomes ranging in size from 400 Mb to 2.4 Gb. Using HPC-GVCW in production mode to call SNPs on a 25 multi-crop-reference genome data set produced over 1.1 billion SNPs that were publicly released for functional and breeding studies. For rice, many novel SNPs were identified and were found to reside within genes and open chromatin regions that are predicted to have functional consequences. Combined, our results demonstrate the usefulness of combining a high-performance SNP calling architecture solution with a subpopulation-aware reference genome panel for rapid SNP discovery and public deployment. Supplementary Information The online version contains supplementary material available at 10.1186/s12915-024-01820-5.


Background
Single-nucleotide polymorphisms (SNPs) are one of the most common types of genetic variation (e.g., SNPs, insertions, deletions, copy number variations, and inversions) used to study genetic diversity among living organisms [1,2], and are routinely detected by mapping resequencing data to reference genomes using various software tools [3][4][5].In major crops, SNPs are routinely discovered using genome resequencing or array-based hybridization methods on thousands of accessions as documented for rice [6,7], maize [8], soybean [9], and sorghum [10].In order for such data to be used more widely for trait discovery, genomic selection, and functional genomics applications, numerous databases have been developed for crop plants such as, e.g., SNP-Seek [11], ViceVarMap [12], MaizeSNPDB [13], and Rice-Navi [14].Unfortunately, as crop communities continue to improve their flagship genome assemblies, as well as produce multiple new assemblies that take into account population structure [15][16][17], and other factors, it is becoming more onerous for such databases to keep pace with the onslaught of new data coming online.
The Genome Analysis Toolkit (GATK) [18,19], one of the most popular software tools developed for SNP identification, has been widely used for SNP detection for many species [9,20], and was recently modified to identify copy number variants (CNVs) in human [21].Although vast amounts of resequencing data have been processed using GATK [6,9,[20][21][22], the processing speed of the publicly available open-source version(s) can be very time-consuming when very large resequencing data sets are involved.For example, it took our consortia almost 6 months to call SNPs with GATK using ~ 3000 resequenced rice accessions mapped to a single reference genome.Although several commercially and publicly available workflows (e.g., Sentieon [23], Clara Parabricks [24], Falcon [25], DRAGEN-GATK [26]) are now available that accelerate GATK processing times, all require special and expensive hardware (e.g., graphics processing units, GPUs; field-programmable gate arrays, FPGAs) and are normally not suitable for processing large population datasets.
To address the need to detect genetic variation on the almost daily release of high-quality genome assemblies we have identified three challenges that must be solved to meet the demand for speed and efficiency of SNP detection.First, the exponential increase in sequencing and resequencing data requires intelligent data management solutions [23][24][25] and compressed data formats to reduce storage [26,27]; second, data analysis needs flexible workflows and monitoring tools for highthroughput detection and debugging [28]; and third, modern high-performance computing (HPC) architectures are needed to complete jobs efficiently [29,30].
To address these challenges, we designed a flexible workflow and employed high-performance computing (HPC) architectures to develop an open-source genome variant calling workflow for GATK (i.e., HPC-GVCW).The workflow was divided into four phases that include a data parallelization algorithm -"Genome Index splitter" (GIS) [31] -that divides genomes into megabase (MB) size chunks for parallel GATK processing and file merging.By dividing genomes into 45 Mb, 10 Mb, and 5 Mb chunks, we found that the smallest chunk size tested gave the optimal performance.Using HPC-GVCW with a chunk size of 5 Mb enabled us to call SNPs from ~ 3000 resequenced rice accessions (with 17 × genome coverage) on a single rice genome (GS ~ 400 Mb) in 120 h, which is almost ~ 36 times faster than previously reported (~ 6 months).
To demonstrate utility, we ran HPC-GVCW on a 25 crop genomes dataset using publicly available resequencing data sets and the most up-to-date (near) gap-free reference genome releases available and called an average of 27.3 M, 32.6 M, 169.9 M, and 16.2 M SNPs for rice (GS ~ 400 Mb), sorghum (GS ~ 700 Mb), maize (GS ~ 2400 Mb), and soybean (GS ~ 1100 Mb), respectively.
To demonstrate the novelty of the genetic variation discovered, our analysis of SNP datasets from a 16-genome "subpopulation-aware" rice reference panel revealed a total of ~ 2.3 M (8.8%) novel SNPs in total that have yet to be publicly released based.Analysis of these novel SNPs identified 1.3 M SNPs in genes, 20% (i.e., 248,403) of which are predicted to have impacts on gene function.Analysis of open chromatin regions (OCRs) of one accession (i.e., Zhenshan 97) revealed the presence of 7441 novel SNP that may have effects on gene regulation.Finally, in a test case to evaluate the allele status of known agriculturally important genes, we identified 180 accessions that contain the submergence tolerant allele in the Sub1A gene that could be integrated into accelerated breeding programs.

HPC-GVCW development
HPC-GVCW was designed into four phases: (1) mapping, (2) variant calling, (3) call set refinement and consolidation, and (4) variant merging (Fig. 1).Briefly, Phase 1 was designed to map clean resequencing reads to a reference genome.Phase 2 was designed to call variants using GATK for each sample.Phase 3 was designed to merge all variants per sample into a non-redundant joint genotype file by genome-wide intervals (also called "chunks").Phase 4 was designed to generate a genomewide joint genotype by assembling all variant intervals (detailed in Additional file 1: Automated Genome Variant Calling Workflow Design [32][33][34][35]; Additional file 2: Fig. S1; Additional file 3: Table S1).The GVCW workflow was designed to run on high-performance computers (Fig. 1a); however, it can also be employed on alternative computational platforms, including hybrid clusters and high-end workstations (Fig. 1b).Of note, each of the four phases is independent of one another, flexible, and scalable across multiple nodes and platforms (Additional file 1: Workflow flexibility).
With this workflow, the most challenging component to address was the merging of large sample sets (e.g., 3000 rice accessions) into a joint file using GATK with a single node, i.e., Phase 3. To address this challenge, we modified the "genome intervals joint genotype" module supported by GATK ("CombineGVCFs" and "GenotypeGVCFs, " detailed in Additional file 1: Automated Genome Variant Calling Workflow Design) by adding an algorithm called "Genome Index Splitter" (GIS) [31] that can optimize the size and number of genomics intervals utilized.The GIS algorithm creates a "chromosome split table" (CST) to index disjoint variant intervals, which can be fine-tuned based on genome size and available "central processing units" (CPUs) (Additional file 2: Fig. S1c-d).Optimal chunks are calculated based on three steps: (1) locate the largest chromosome length in a given reference genome; (2) calculate the fairness of a divisible integer for a given maximum number of cores; and (3) whole genome reference sequences are divided by the optimal integer number, as illustrated in Additional file 2: Fig. S1e.
For example, the CST with the entries as follows: < chromosome name (ChrName), chunk number (Chunk_no), chromosome starting position (Start), chromosome end position (End) > (ChrName, Chunk_no, Start, End).Once chunk size is optimized, jobs (both GATK's "CombineGVCFs" and "GenotypeGVCFs" functions) can be distributed and parallelized by chunks (Additional file 2: Fig. S1f-g).Leveraging this algorithm ensures that the creation of disjoint variant intervals is optimized based on genome size and computational resources, thereby preventing the underutilization of resources and the reduction of execution times.

HPC-GVCW benchmarking
To evaluate the precision of SNP identification of GVCW, we initially assessed the workflow across three computational platforms -i.e., supercomputer, clusters, and high-end workstations, using a subset of The 3000 Rice Genome Project (3 K-RGP) dataset [6] (n = 30) mapped to The International Rice Genome Sequencing Project (IRGSP) Reference Sequence (RefSeq) [36].We observed a 93.8-94.3%identical call rate across the three platforms and a 83-94% identical call rate when compared with previously published results [37] (Additional file 2: Fig. 2a).
We further compared the efficiency of total CPU hours (i.e., the execution time if all jobs were operated between the standard and genome chunk strategies) between GATK and HPC-GVCW.For GATK, a total of 304 CPU hours was required (9.5 h × 1 node × 32 cores/node) (Fig. 2b), vs. HPC-GVCW which ranged from 63 (chunk size = 5 Mb, nodes = 8) to 2511 (chunk size = 10 Kb, nodes = 2342) hours when using different chunk size/ node combinations (Fig. 2b).This equates to a maximum of 4.8 times more efficient, to 8 times less efficient as Fig. 1 Automated and flexible genome variant calling workflow (GVCW) design for a HPC systems and b diversified system architectures compared to the standard GATK approach, respectively.Of note, we found that the number of CPU hours either increased or decreased at chunk sizes greater or less than 5 Mb when using HPC-GVCW, which is recommended when using the workflow.
Overall, our results reveal that execution time can be reduced by a maximum of 283 times when the smallest genome interval is set to 10 Kb/chunk, and CPU efficiency could be improved 4.8 times using a genome interval set to 5 Mb/chunk for HPC-CVCW, as compared with GATK.

HPC-GVCW benchmarking for multiple crop species
To test if HPC-GVCW could be widely used across multiple crop species, we re-called SNPs using previously published resequencing/reference genome data sets for rice, sorghum, maize, and soybean (Additional file 3: Table S1 and Availability of data and materials for details).Using KAUST's Shaheen 2 supercomputer with 30 K cores, processing 3,024 resequenced samples (3 K-RGP) mapped to a single rice reference genome took 94 h (i.e., 3.91 days) (Additional file 3: Table S2).For the sorghum, maize, and soybean data sets, due to the small number of samples, we only benchmarked HPC-GVCW on a hybrid cluster with 3000 cores and found that even for a 2.4 Gb maize genome [38], SNP calling for 282 samples could be completed within ten days (Additional file 3: Table S2).Our benchmarking test identified 26.5 M, 32.7 M, 167.6 M, and 15.9 M SNPs for rice (IRGSP-1.0),sorghum (BTx623), maize (B73 v4), and soybean (Gmax 275 v2.0), respectively (Table 1 and Additional file 3: Table S3).To assess the accuracy of the SNP calls produced through HPC-GVCW compared with previous reports, we found that 86.3% of the rice (22.8 M) and 89.3% of the sorghum (29.2 M) SNPs were identical (Additional file 2: Fig. S2c-d and Additional file 3: Table S3).For maize, only 25% of the SNP calls overlapped which was likely due to the software and strategy used for SNP calling and filtering [39].For soybean, a direct comparison was not possible due to lack of data availability.

HPC-GVCW at production scale -a 25-genome SNP dataset for multiple crop species
Since the majority of publicly available SNP data for major crop species have yet to be updated on the recent wave of ultra-high-quality reference genomes coming online, we applied HPC-GVCW to call SNPs, with the identical large resequencing datasets, on the most current and publicly available genome releases for rice (i.e., the 16 genome Rice Population Reference Panel) [15,40,41], maize (B73 v4, B73 v5, and Mo17v2) [16,42], sorghum (Tx2783, Tx436, and TX430) [43], and soybean (Wm82 and JD17) [44].

Novel SNPs in rice
Having the ability to map large-scale resequencing datasets rapidly (e.g., 3 K-RGP) to multiple genomes (e.g., the 16-genome RPRP dataset), HPC-GVCW opens the possibility to discover and rigorously interrogate population-level pan-genome datasets on multiple scales -i.e., pan-genome, genome and single gene scale.

Pan-genome scale
Our analysis of the 3 K-RGP dataset [6] S3, and Additional file 3: Table S5).We found that an average of 36.5 Mb of genomic sequence is absent in a single rice genome but is present in at least one of the other 15 RPRP data sets, which is equivalent to ~ 2.1 M SNPs (Fig. 3, Additional file 2: Fig. S3, and Additional file 3: Table S5).For example, when considering the flagship reference genome for rice, i.e., the IRGSP RefSeq [36], a total of ~ 36.6 Mb of genomic sequence is completely absent in the IRGSP RefSeq but is found spread across at least one of the 15 genomes (~ 2.43 Mb/genome), and includes ~ 2.3 M previously unidentified novel SNPs (Fig. 3, Additional file 3: Table S5).
To validate these potentially functional SNPs, we measured the frequency of all 248,403 SNPs across the 3 K-RGP data set as shown in Additional file 2: Fig. S5a.The results show that 76.31% (189,564) of these putative functional SNPs could be identified within three or more rice accessions, thereby confirming the presence and quality of these SNP variants.These results show that much of the collective rice pan-genomes remain to be explored for crop improvement and basic research.

Genome scale -Zhenshan 97 (ZS97)
Open chromatin regions (OCRs) are special regions of the genome that can be accessed by DNA regulatory elements [47,48].Chromatin accessibility (CA) of OCRs can affect gene expression, epigenetic modifications, and patterns of meiotic recombination of tissue cells that could lead to important regulatory effects on biology observation [49,50].For rice, using the IRGSP Ref-Seq and the 3 K-RGP dataset, we previously annotated 5.06 M variants that were located in OCRs, of which ∼2.8% (~ 142,000) were classified as high-impact regulatory variants that may play regulatory roles across multiple tissues [51].
To search for novel SNPs in OCRs that are not present in the IRGSP RefSeq, we scanned for SNPs in OCRs of ZS97, a Xian/Indica variety, as a test case.First, our analysis revealed that approximately 14.6% of the ZS97 genome contains OCRs across the 6 tissues investigated, i.e., flag leaf, flower, lemma, panicle, root, and young leaf (Additional file 2: Fig. S6).We then conducted an intersection analysis of identified OCRs (peak regions) with variant call format (VCF) files and discovered 3,303,820 SNPs located within OCRs of the ZS97 genome (Additional file 3: Table S7), of which 7,441 were novel (i.e., relative to the IRGSP RefSeq).This equates to 6.23% of the 1.19 M ZS97 novel SNPs discussed above (Additional file 3: Table S7).
To validate these SNP, we again measured SNP frequency across 3 K-RGP for all 7441 novel SNPs and found that 78.13% (5814) of these SNPs could be identified in three or more accessions (Additional file 2: Fig. S5b).
To assess the potential functional impact of these novel ZS97 SNPs, we established thresholds by selecting the top and bottom five percent of variation scores from previously scored SNP variation data across as reported [51], which led to the identification of 855 SNPs (Additional file 3: Table S6).Notably, these SNPs accounted for approximately 33.3% of the loci with significant variations, which are considered large-effect SNP loci with a greater impact on chromatin accessibility (CA).These results indicate the effects of physical access to chromatinized DNA, to binding, allowing for active gene transcription for novel SNPs.

Single gene scale -Sub1A
Many of the genes and SNPs identified in our pangenome variant analysis have yet to be tested for their contributions to agronomic performance and biotic and abiotic stresses.For example, prolonged submergence during floods can cause significant constraints to rice production resulting in millions of dollars of lost farmer income [52].One solution to flooding survival has been to cross the Sub1A gene, first discovered in a tolerant indica derivative of the FR13A cultivar (IR40931-26) in 2006 [52], into mega rice varieties such as Swarna, Sambha Mahsuri, and IR64 [52,53].Our analysis of the Sub1A locus across the pan-genome of rice showed that this gene could only be observed in 4 out of 16 genomes in the RPRP data set, including IR64 (Fig. 3c, d).Since Sub1A is absent in the IRGSP RefSeq, the genetic diversity of this locus can only be revealed through the analyses of reference genomes that contain this gene.Thus, we applied the IR64 reference as the base genome for SNP comparisons, and identified a total of 26 SNPs in the Sub1A locus across 3 K-RGP, 6 of which have minor allele frequencies (MAF) greater than 1% (Fig. 1F), including a previously reported SNP (7,546,665-G/A), which is also validated by 4 gene sequences, i.e., OsIR64_09g0004230, OsLima_09g0004190, OsGoSa_09g0004200, and OsARC_09g0004070.This variation resulted in a nonconservative amino acid change from serine (S, Sub1A-1, tolerance-specific allele) to proline (P, Sub1A-2, intolerance-specific allele) [52] (Fig. 3e, f ).The majority of accessions in the 3 K-RGP data set (i.e., 2173) do not contain the Sub1A gene, while 848 do, 668 of which (22.11%) have the Sub1A-2 allele, while 180 accessions (5.96%) contain the Sub1A-1 allele (Fig. 3f ).Understanding the genetic diversity of the Sub-1A gene at the population level helps us understand and filter variants that are predicted to show flooding tolerance across Fig. 3 Rice Population Reference Panel (RPRP) [15] pan-genome variant analysis.a Circos plot depicts the distribution of genomic attributes along the IRGSP RefSeq (window size = 500 Kb).b Comparison of genomic attributes, i.e., genes, SNPs, Pi, and Theta on chromosome 9 across the 16 RPRP pan-genome data sets (window size = 10 Kb).c Rice Gene Index (RGI) comparison of the Sub loci across the 16 RPRP pan-genome data set.d Phylogenetic analysis of Sub1A, Sub1B, and Sub1C across the 16 RPRP pan-genome data set.e Amino acid alignment of the Sub1A gene across the RPRP.f Survey of SNPs within the Sub1A gene across the 3 K-RGP resequencing data set.This analysis revealed the genomic status of the Sub1A gene (presence/absence; submergence tolerance/intolerance) across the 3 K-RGP data set the 3 K-RGP, which could be further applied to precise molecular-assisted selection (MAS) breeding programs.In addition, such pan-genome analyses may also reveal new variants that could provide valuable insights into the molecular mechanisms of flooding tolerance.

Discussion
With the ability to produce ultra-high-quality reference genomes and population-level resequencing data -at will -accelerated and parallel data processing methods must be developed to efficiently call genetic variation at scale.We developed a publicly available open-source high-performance (CPU-based) computing pipeline (HPC-GVCW) that is supported across diversified computational platforms, i.e., desktops, workstations, clusters, and other high-performance computing architectures.In addition, HPC-GVCW was containerized for both Docker [54] and Singularity [55] for reproducible results without reinstallation and software version incompatibilities.
Comparison of SNP calls on identical data sets (i.e., rice 3 K-RGP to the IRGSP RefSeq and 400 samples from Sorghum Association Panel to the BT623v3.1)yielded similar results, however, run times could be reduced from more than six months to less than one week, as in the case for rice 3 K-RGP [6].The GVCW pipeline enabled the rapid identification of a large amount of genetic variation across multiple crops, including sorghum, maize, and soybean on the world's most up-to-date, high-quality reference genomes.These SNPs provide an updated resource of genetic diversity that can be utilized for both crop improvement and basic research, and are freely available through the SNP-Seek [56], Gramene web portals [57], and KAUST Research Repository (KRR [58]).
Key to our ability to rapidly call SNPs on a variety of computational architectures lies in the design of the HPC environment and the distribution of work across multiple nodes.Our next steps will be to apply GVCW on improved computing platforms, e.g., KAUST Shaheen III with unlimited storage and file numbers, 5000 nodes, faster input and output (I/O), and tests on larger forthcoming data sets [59].In addition to GATK, other SNP detection strategies such as the machine learning-based tool "DeepVariant" [3], which shows better performance in execution times with human data [5], have yet to be widely used in plants.With a preliminary analysis of the rice 3 K-RGP dataset, "DeepVariant" identified a larger number of variants at a similar or lower error rate compared to GATK [60].To test how artificial intelligence (AI) can be used to improve food security by accelerating the genetic improvement of major crop species, we plan to integrate "DeepVariant" into our HPC workflow to discover and explore new uncharacterized variation.
In addition, we also plan to apply similar pan-genome strategies on more species beyond rice, sorghum, maize, and soybean to discover and characterize hidden SNPs and diversity, which could provide robust and vital resources to facilitate future genetic studies and breeding programs.

Conclusions
We developed HPC-GVCW for variant calling in major crops, which can reduce execution times > 280 fold, as well as increase efficiency > 4.8 fold as compared with the GATK 'best practice' workflow [19].A new algorithm ("Genome Index splitter") for running 'CombineGVCFs' was designed to parallelize this step and was found to be 19 times faster than available default options.We demonstrated that the entire workflow can be used on a variety of computing platforms, such as hybrid clusters, and high-end workstations using Docker and Singularity images.Using HPC-GVCW, we called population panel variants for the latest high-quality genome references and created 25 immediately applicable datasets with an average of 27.3 M, 32.6 M, 169.9 M, and 16.2 M SNPs for rice (16 population panel references), sorghum (4), maize (3), and soybean (2), respectively.Analysis of a 16-genome rice reference panel revealed ~ 2.3 M novel SNPs relative to the IRGSP RefSeq, which equates to an approximate 8% overall increase in SNP discovery that can be applied immediately to precise molecular-assisted selection (MAS) breeding programs and functional analyses.

SNP annotation
SNPs located in coding regions across the 25 reference genome data sets were identified using their respective annotation files, and functional SNPs were predicted using SnpEff (v5.0e) [45].

Structural variation (SV) update across the 16-genome rice population reference panel (RPRP)
In 2020, we published an index of large structure variations (> 50 bp, SVs) across the 16-genome RPRP that included the MH63RS2 and ZS97RS2 genome assemblies [40].Here, we updated this index using the latest gap-free genome assemblies for these genomes -i.e., MH63RS3 and ZS97RS3 [65] -using the same methods as previously described.To validate this updated SV index, we randomly selected 50 insertions and 50 deletions across the 16 rice genome (RPRP), using the IRGSP RefSeq as the reference and the remaining 15 rice genomes as queries, which included a total of 1,500 entries ((50 + 50) × 15 = 1500).
We then manually validated each SV with alignment information in the Integrative Genomics Viewer (IGV) using raw reads and alignment blocks with Nucmer [66].SVs were considered valid if the two methods could identify the identical insertion or deletion and resulted in 94.6% of the insertions and 99.3% of the deletions being validated as true SVs.

Homologous gene identification across the 16-genome Rice Population Reference Panel (RPRP) based on sequence alignment and syntenic position
As with SVs above, we also updated our rice gene index (RGI) using updated MH63RS3 and ZS97RS3 gene annotations with identical pipeline [41].Briefly, homologous gene sets across the 16-genome RPRP were identified using GeneTribe software [67], by combining protein sequence similarity and collinearity (i.e., synteny) information.Homologous relationships included "reciprocal best hits" (RBHs), "single-side best hits" (SBHs), one-to-many, and singletons.Based on the one-to-one relationships (both RBH and SBH), and considering the collinearity blocks, we removed redundant homologous gene groups to obtain 79,111 non-redundant homologous gene groups.Finally, these non-redundant homologous gene groups were clustered with the "Connected Graph Algorithm" [68] to obtain 41,137 homologous gene groups.

Rice pan-genome SNP analysis
Using the updated SV and RGI data sets in combination with the 16-genome RPRP SNP data set, we conducted a pan-genome SNP analysis to classify genomic regions into core, dispensable, genome-specific, and genome-absent regions [69].Core regions are defined as sequences that are present in all 16 RPRP genomes.Dispensable regions are defined as sequences that are observed in 2 to 15 of the 16 RPRP genomes.Genomespecific regions are defined as sequences that are present in only one of the 16 RPRP genomes, but absent in the remaining 15.Genome-absent regions are defined as sequences that are not present in one of the 16 RPRP genomes, but are present in at least one of the other 15 genomes.For the presence and absence of genes, we classified homologous gene groups as core, dispensable, specific, and absent genes, representing the same logic flow as large SVs.Bedtools (v2.30.0) [70] subcommand "subtract" was used for core region identification, and the subcommand "intersect" was used for SNP extraction.

Chromatin accessibility of novel SNPs in open chromatin regions
Accessible Chromatin, combined with high-throughput sequencing (ATAC-seq) is widely used as one of the mainstream OCR detection methods [51,71,72].In this study, ATAC-seq data from 6 tissues of ZS97RS3, i.e., flag leaf, flower, lemma, panicle, root, and young leaf were obtained from NCBI BioProject PRJNA705005 [73].In the initial steps of analyzing raw ATAC-seq data, we conducted quality control using FastQC [74].This quality control process involved evaluating the quality of sequenced bases, average GC content, and the presence of repetitive sequences.Notably, we observed variations in the content of the first four bases at the 5′ end of each sample.To address this issue, we further refined our data by using fastp (v0.12.4) [75] to remove low-quality data and trim 20 base pairs from the 5' ends.Subsequently, we employed BWA's mem [76] algorithm to align the sequencing data with the ZS97RS3 rice genome while filtering out reads that mapped to mitochondrial and chloroplast DNA.Peak regions of open chromatin regions (OCRs) within the ATAC-seq data were identified using MACS2 [77] with specific parameters: "-shift -100 -extsize 200 -nomodel -B -SPMR -g 3.0e8 -call-summits -p 0.01." Following this, peak call results from each individual sample were combined using BEDtools (v2.26.0) [70] with default settings for merging.
To assess the potential functional impact of novel SNPs, we employed the intragroup Basenji model to study their variation scores [51].Based on the Basenji model training, we predict the effect of variation in different tissues on chromatin accessibility (CA) in neighboring genomic regions.For each variation, we construct two sequences that contain the mutation site and the sequences around it, differing only at the mutation site.We then predict CA in each of these two sequences and score the effect of variants by comparing the CA differences between the two genotypes in the 1 kb region around the mutation site.The higher the score of the SNP, the greater the effect on CA in open chromatin regions.

Fig. 2
Fig.2Benchmarking of the Phase 3 GIS parallelization HPC-GVCW as compare with the standard GATK pipeline using 30 resequenced rice accessions mapped to a single reference genome, a execution time and b CPU hours (execution time × number of nodes) for job completion.Notes: Comparisons were tested between the standard GATK pipeline without chunks using 1 node (blue dots), and HPC-GVCW using a range of computing nodes chunked length combinations, i.e., chunks sizes of 10 Kb, 100 Kb, 200 Kb, 500 Kb, 1 Mb, 5 Mb, 10 Mb, 20 Mb, and chromosome level, which use 2342, 237, 120, 50, 27, 8, 6, 5, and 4 nodes, respectively (yellow dots)

Table 1
Number of SNPs identified across four major crop species using their most recent public genome releases

Table 1 (continued) Species Reference genome Acronyms GenBank ID Number of SNPs SNPs in exons SNPs in 3′ UTR SNPs in 5′ UTR 5′ UTR premature start codon gain variant Missense variant Start lost Stop gained Stop lost
[15]ed to the 16-genome RPRP dataset[15]revealed a core genome of 314.1 Mb, an average dispensable genome of 56.55 Mb, and a private genome of ~ 745 Kb/genome (see Methods for definitions), that contain ~ 22.4 M, 3.2 M and 33.8 K SNPs, respectively (Additional file 2: Fig.